source(here::here("script/unsupervised_learning/embedding/word_and_document_embedding.R"))
## here() starts at /Users/lucasvogt/Unil/3rd_Semester/Text_Mining/Text_Mining_Project
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: RColorBrewer
## 
## Package version: 3.2.4
## Unicode version: 14.0
## ICU version: 70.1
## 
## Parallel computing: 8 of 8 threads used.
## 
## See https://quanteda.io for tutorials and examples.
## 
## Loading required package: NLP
## 
## 
## Attaching package: 'NLP'
## 
## 
## The following objects are masked from 'package:quanteda':
## 
##     meta, meta<-
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## 
## Attaching package: 'tm'
## 
## 
## The following object is masked from 'package:quanteda':
## 
##     stopwords
## 
## 
## 
## Attaching package: 'reshape2'
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
## 
## 
## 
## Attaching package: 'rvest'
## 
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
## 
## 
## Linking to ImageMagick 6.9.12.3
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
## 
## Using poppler version 22.02.0
## 
## 
## Attaching package: 'igraph'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## 
## The following object is masked from 'package:base':
## 
##     union
## 
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## 
## 
## 
## Attaching package: 'bslib'
## 
## 
## The following object is masked from 'package:broom':
## 
##     bootstrap
## 
## 
## The following object is masked from 'package:utils':
## 
##     page
## 
## 
## 
## Attaching package: 'summarytools'
## 
## 
## The following object is masked from 'package:tibble':
## 
##     view
## 
## 
## 
## Attaching package: 'flextable'
## 
## 
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
## 
## 
## The following object is masked from 'package:igraph':
## 
##     compose
## 
## 
## The following object is masked from 'package:purrr':
## 
##     compose
## 
## 
## Loading required package: Matrix
## 
## 
## Attaching package: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## 
## Attaching package: 'sentimentr'
## 
## 
## The following object is masked from 'package:flextable':
## 
##     highlight
## 
## 
## The following object is masked from 'package:lexicon':
## 
##     available_data
## 
## 
## 
## Attaching package: 'quanteda.textplots'
## 
## 
## The following object is masked from 'package:igraph':
## 
##     as.igraph
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## 
## Attaching package: 'text2vec'
## 
## 
## The following object is masked from 'package:igraph':
## 
##     normalize
## 
## 
## 
## Attaching package: 'jsonlite'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
## 
## 
## Loading required package: proxyC
## 
## 
## Attaching package: 'proxyC'
## 
## 
## The following object is masked from 'package:stats':
## 
##     dist
## 
## 
## 
## Attaching package: 'seededlda'
## 
## 
## The following object is masked from 'package:igraph':
## 
##     sizes
## 
## 
## The following object is masked from 'package:stats':
## 
##     terms
## 
## 
## 
## Attaching package: 'rio'
## 
## 
## The following object is masked from 'package:reticulate':
## 
##     import
## 
## 
## The following object is masked from 'package:quanteda':
## 
##     convert
## 
## 
## randomForest 4.7-1.1
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:ranger':
## 
##     importance
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:rio':
## 
##     export
## 
## 
## The following object is masked from 'package:sentimentr':
## 
##     highlight
## 
## 
## The following objects are masked from 'package:flextable':
## 
##     highlight, style
## 
## 
## The following object is masked from 'package:igraph':
## 
##     groups
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## Loading required package: gtools
## 
## 
## Attaching package: 'gtools'
## 
## 
## The following object is masked from 'package:igraph':
## 
##     permute

Word embedding

The word embedding has been applied on our three reviews datasets: smartphone_reviews_final.csv, Apple_final.csv, and Samsung_final.csv. To embed words, we have decided to apply the word2vec::word2vec function using the cbow method. To visualize the results, we used the uwot::umap and the package plot_ly. In the following 2D and 3D graphs, only a subset of the words have been plotted in order to facilitate the visualization.

TODO DISPLAY head of DF resulting from word embedding

Reading in word embedding models from three binary files: all_model.bin, apple_model.bin, and samsung_model.bin Creating a 3D visualization of the word embeddings using the umap function from the uwot package and the plotly package Preprocessing text data by removing punctuation, lowercasing and splitting the reviews into individual words, and embedding the documents using the document_embedding function and the embedded words matrix Transforming the list of embedded documents into a matrix and removing attributes Saving the embedded words and documents to two CSV files: embedded_words.csv and embedded_documents.csv